{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "df_2019_jul = pd.read_csv('../data/nyc_taxi_2019-07.csv',\n",
    "                usecols=['passenger_count', \n",
    "                        'total_amount', 'payment_type'])\n",
    "df_2019_jul['year'] = 2019\n",
    "\n",
    "df_2020_jul = pd.read_csv('../data/nyc_taxi_2020-07.csv',\n",
    "                usecols=['passenger_count', \n",
    "                        'total_amount', 'payment_type'])\n",
    "df_2020_jul['year'] = 2020\n",
    "\n",
    "df = pd.concat([df_2019_jul, df_2020_jul])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 1\n",
    "\n",
    "Use the `corr` method on `df` to find the correlations among the columns. How would you interpret these results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>passenger_count</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.016410</td>\n",
       "      <td>0.014943</td>\n",
       "      <td>-0.049558</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>payment_type</th>\n",
       "      <td>0.016410</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.138561</td>\n",
       "      <td>0.029277</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>total_amount</th>\n",
       "      <td>0.014943</td>\n",
       "      <td>-0.138561</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.019706</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <td>-0.049558</td>\n",
       "      <td>0.029277</td>\n",
       "      <td>-0.019706</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 passenger_count  payment_type  total_amount      year\n",
       "passenger_count         1.000000      0.016410      0.014943 -0.049558\n",
       "payment_type            0.016410      1.000000     -0.138561  0.029277\n",
       "total_amount            0.014943     -0.138561      1.000000 -0.019706\n",
       "year                   -0.049558      0.029277     -0.019706  1.000000"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is no significant correlation among any of these columns. The closest that we see is between payment type and total amount — but even there, it's an extremely small, negative correlation. \n",
    "\n",
    "(The payment type is actually categorical, so saying that there's a numeric correlation is a bit weird in any event. But given that 1 is credit and 2 is cash, a positive correlation would mean that a higher fare would be more likely to use cash. A negative correlation thus means that a higher fare is *less* likely to be paid in cash, which matches our common sense. But again, the correlation is extremely weak.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 2\n",
    "\n",
    "Show, with a single command, the difference in descriptive statistics for `total_amount` beween 2019 and 2020. Round values to use no more than 2 digits after the decimal point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count   -5510007.00\n",
       "mean          -0.98\n",
       "std           -0.75\n",
       "min           53.20\n",
       "25%           -0.50\n",
       "50%           -0.60\n",
       "75%           -0.75\n",
       "max        -4672.45\n",
       "Name: total_amount, dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    df.loc[df['year'] == 2020, 'total_amount'].describe().round(2) - \n",
    "    df.loc[df['year'] == 2019, 'total_amount'].describe().round(2)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 3\n",
    "\n",
    "If we assume that zero-passenger trips are for delivering packages, how were those affected during the pandemic? Show the proportion of such trips in 2019 vs. 2020."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.01666432611802781"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[(df['year'] == 2019) & \n",
    "       (df['passenger_count'] == 0), 'passenger_count'].count() / df['passenger_count'].count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0027809994974354953"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[(df['year'] == 2020) & \n",
    "       (df['passenger_count'] == 0), 'passenger_count'].count() / df['passenger_count'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Turns out that the proportion of zero-person taxi rides shrank as well, from about 1.6% in 2019 to 0.2% in 2020. Honestly, I expected it to be higher during the pandemic!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
